feat(agent-wiki): add compare-outcomes pass for contrastive guidelines by vinodmut · Pull Request #274 · AgentToolkit/altk-evolve

vinodmut · 2026-06-22T20:21:51Z

What this adds

A new pass in the agent-wiki ingest pipeline — agent-wiki-compare-outcomes — that derives contrastive guidelines by comparing successful vs failed trajectories for the same (or similar) task, rather than mining rules from a single trajectory.

Every other pass in the pipeline (summarize / extract / synthesize) learns from one trajectory at a time. This pass learns from the contrast: a rule is only promoted when it's backed by a failed path, a successful path, and concrete trajectory evidence (task wording, observed tool/API calls, transcript/doc snippets). It can LLM-judge success/failure straight from the normalized transcript, so it does not depend on benchmark-specific outcome labels.

Extends the agent-wiki exploration merged in #268; related to the offline extraction/consolidation idea in #256.

Changes

explorations/agent-wiki/skills/
├── agent-wiki-compare-outcomes/
│   ├── SKILL.md                    new — the contrastive-comparison pass
│   └── scripts/compare_outcomes.py new — self-contained (stdlib only) evidence-pack builder
└── agent-wiki-ingest/SKILL.md      wired in as a conditional Step 4.5

The skill (SKILL.md): a 3-step workflow — build an evidence pack over normalized trajectories (grouped by task_id, success/failure judged or stored), inspect candidate rules, and promote only strong ones (one failed + one successful run in the same group, a task-action tool/API or workflow difference, source IDs for both sides). Weak candidates stay hypotheses, not rules.
The script (compare_outcomes.py): groups traces, contrasts success/failed runs, extracts tool/API calls + transcript evidence, optionally LLM-judges outcomes (--judge-outcomes never|missing|always), and emits an analysis JSON + Markdown (and optional render-ready guideline entities). Stdlib-only, no repo-internal deps.
Ingest wiring: added as Step 4.5 (conditional) — runs after synthesize, before consolidate — so contrastive guidelines can participate in clustering. Updated the description, the subagent list, the pipeline diagram, and added the step section. Skips cleanly when the corpus has no success/failure contrast.

Scope

This ports only the compare-outcomes capability + its ingest wiring from the appworld-agent-wiki-experiment branch. That branch also bundled unrelated changes (a consolidate "mine step" rewrite, a synthesize faithfulness rule, a run_agent_wiki_skill_pass.py helper) — those are intentionally not included here, to keep this PR focused.

Verification

ruff check + ruff format --check: clean.
mypy .: clean (the script carries the # mypy: ignore-errors header used by every sibling exploration/reference script).
detect-secrets: passes.
Smoke test: ran the script end-to-end on a success/failure trajectory pair — it groups by task_id, contrasts the two runs, and emits the analysis without error.

No changes outside explorations/agent-wiki/.

Summary by CodeRabbit

New Features
- Introduced optional "compare-outcomes" pipeline stage to generate evidence-backed contrastive guidelines by comparing successful and failed agent trajectories within the same task context.
Documentation
- Added comprehensive documentation for the compare-outcomes skill, including operational workflows and decision criteria.
- Updated design documentation and ingest workflows to reflect the new conditional pipeline stage.

Adds a new pipeline skill, agent-wiki-compare-outcomes, that derives *contrastive* guidelines by comparing successful vs failed trajectories for the same/similar task — rather than mining rules from a single trajectory. It LLM-judges success/failure from the normalized transcript (no dependency on benchmark-specific outcome labels) and grounds each rule in evidence (task wording, observed tool/API calls, transcript/doc snippets). The bundled compare_outcomes.py is self-contained (stdlib only). Wires it into the ingest orchestrator as a conditional Step 4.5 (after synthesize, before consolidate): the description, subagent list, pipeline diagram, and a new step section that spawns one agent-wiki-compare-outcomes subagent over the corpus when there's a success/failure contrast, renders any strong contrastive guidelines, and skips cleanly when there's no contrast. Documents the new pass in the overview docs so it's discoverable: the README skills tree, and design.md's pipeline diagram, stage table, ingest narrative, and a short "learning from contrast" rationale. Ported from the appworld-agent-wiki-experiment branch; scoped to just the new skill + its ingest wiring (the branch's separate consolidate "mine step" and synthesize changes are intentionally not included). Builder/CI conventions followed: file-local `# mypy: ignore-errors` header matching sibling scripts.

coderabbitai · 2026-06-22T20:22:07Z

Warning

Review limit reached

@vinodmut, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 50 minutes and 42 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9fc0d59a-9234-4a08-a5cb-2372bc791d5c

📥 Commits

Reviewing files that changed from the base of the PR and between b9c8d44 and 3dfb8e2.

📒 Files selected for processing (1)

explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py

📝 Walkthrough

Walkthrough

Adds an agent-wiki-compare-outcomes skill to the agent-wiki exploration. This includes a new compare_outcomes.py Python script (~897 lines) that groups agent trajectory traces by task, extracts tool/API evidence, optionally judges outcomes via an LLM, derives contrastive guideline candidates comparing successful vs. failed runs, and outputs JSON, Markdown, and guideline payloads. Design docs and the agent-wiki-ingest orchestrator are updated to incorporate this as a conditional step 4.5 in the pipeline.

Changes

agent-wiki compare-outcomes skill

Layer / File(s)	Summary
Pipeline design and README documentation `explorations/agent-wiki/README.md`, `explorations/agent-wiki/docs/design.md`	README directory layout and design.md pipeline diagram, stage table, and explanatory paragraphs updated to insert `agent-wiki-compare-outcomes` as a cross-corpus conditional stage running before consolidation.
compare-outcomes skill specification `explorations/agent-wiki/skills/agent-wiki-compare-outcomes/SKILL.md`	New SKILL.md defines the full workflow: running `compare_outcomes.py`, evidence grouping, judging modes (`never`/`missing`/`always`), candidate promotion criteria requiring both success and failure evidence, provenance rendering, and guardrails against over-generalization.
Ingest orchestrator integration `explorations/agent-wiki/skills/agent-wiki-ingest/SKILL.md`	`agent-wiki-ingest` updated to reference `-compare-outcomes` as a sub-skill, inserts step 4.5 in the pipeline enumeration, and adds a full conditional execution section covering judging options, promotion criteria, temp-file rendering, and ordering constraints.
Script: CLI, TraceSummary, evidence extraction, judgment cache `explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py` (lines 1–215)	Adds CLI argument parsing, trace file iteration, `TraceSummary` dataclass, `summarize_trace` (extracting task text, stored/judged success, tool calls, tool docs, API call logs), `should_judge`, and disk-backed judgment cache helpers.
Script: LLM outcome judging `explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py` (lines 217–463)	Implements `judge_outcome`/`build_judge_prompt` with cache-keyed OpenAI calls, strict JSON response parsing with fallback, `compact_transcript`/`truncate_middle`, and all evidence extraction/normalization helpers (`extract_task_text`, `extract_code_tool_calls`, `extract_tool_docs`, `extract_api_calls`, URL/tool normalization, `group_key`).
Script: candidate derivation and NLP scoring `explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py` (lines 464–735)	Implements `compare_group`, `derive_candidates` (tool-selection), `derive_intensity_candidates` (call-count-based), `dedupe_candidates`, tool family bucketing, `semantic_alignment`/`tokenize`/`distinctive_terms`, confidence scoring, and rule drafting helpers.
Script: output serialization and guideline rendering `explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py` (lines 742–897)	Adds `trace_to_json`, `render_markdown`, `render_guideline_payload` (filtered by confidence threshold and tool-description presence), `stable_terms`, `render_contrastive_rule_content`/`render_contrastive_trigger`, `clean_sentence`, and the module entry point.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

visahak

Poem

🐇 Hop, hop! I sniff each trace,
Success and failure, face to face.
With evidence I draft my rule,
No phantom tools — that ain't cool!
Contrastive wisdom, neat and tight,
The wiki grows more sharp tonight. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(agent-wiki): add compare-outcomes pass for contrastive guidelines' directly and clearly summarizes the main change: introducing a new compare-outcomes pass to generate contrastive guidelines from successful vs. failed trajectories.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py`:
- Around line 261-276: The client.chat.completions.create() call lacks an
explicit timeout configuration, which could cause the script to block
indefinitely if the API becomes unresponsive during batch processing. Add a
timeout to prevent excessive blocking: either add a timeout parameter when
instantiating the OpenAI client (e.g., timeout=60.0), or use the with_options()
method on the client immediately before calling chat.completions.create() to
apply the timeout at the request level. Choose whichever approach fits your
codebase structure best.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9b9045a-28f8-42e2-8491-cb9edc3918d0

📥 Commits

Reviewing files that changed from the base of the PR and between 4d5b285 and b9c8d44.

📒 Files selected for processing (5)

explorations/agent-wiki/README.md
explorations/agent-wiki/docs/design.md
explorations/agent-wiki/skills/agent-wiki-compare-outcomes/SKILL.md
explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py
explorations/agent-wiki/skills/agent-wiki-ingest/SKILL.md

Addresses CodeRabbit review finding: Add a timeout to the LLM API call A 60s client-level timeout prevents the batch judge loop from blocking indefinitely if the API becomes unresponsive.

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py

fix(agent-wiki): add timeout to compare-outcomes LLM judge call

3dfb8e2

Addresses CodeRabbit review finding: Add a timeout to the LLM API call A 60s client-level timeout prevents the batch judge loop from blocking indefinitely if the API becomes unresponsive.

vinodmut requested review from illeatmyhat, jayaramkr and visahak June 22, 2026 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(agent-wiki): add compare-outcomes pass for contrastive guidelines#274

feat(agent-wiki): add compare-outcomes pass for contrastive guidelines#274
vinodmut wants to merge 2 commits into
AgentToolkit:mainfrom
vinodmut:explorations/agent-wiki-compare-outcomes

vinodmut commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vinodmut commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

Changes

Scope

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vinodmut commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading